Introduction and data info
This document shows a worked example of analyzing 24-hour
accelerometry data, using a subset of the open-access Long-term Movement
Monitoring Database. I’ve already summarized the raw 100 Hz
acceleration data and summarized it in one-minute epochs as “MIMS units”
via the MIMSunit R package (link here). MIMS
units are very similar to “counts” available from Actigraph devices, but
they have some technical and practical advantages. Check out the
paper for more on MIMS units.
The code and data accompanying this demo can be found at GitHub
here.
I made this demo to accompany my talk for the 2023 Mathematical
Sciences in Obesity Research short course. This is an
educational analysis meant to be quick and feasible on a
lower-end laptop, which put some hard constraints on how much and how
complex of a dataset we can work with. It also skips over some of the
finer points that really do matter in research studies, like what to do
about non-wear-time and other sources of missingness in your data. Also,
in part because of the small size of the dataset, we don’t get
particularly interesting or insightful results. Repeat this analysis on
the entire NHANES cohort and it’d probably be a different story.
The present dataset consists of waist-worn accelerometer data from
one 24-hour day (midnight to midnight) from 50 older adults. These older
adults were classified as “fallers” (people who had a history of falls)
and “controls” (no history of falls). This is a subset of the larger
LTMMD; I preprocessed only subjects with a reasonable amount of wear
time in their first 24-hour day of device wear. The full dataset is
larger (~80 subjects) and has multiple days of data from each
subject.
Main goals
This educational demo is intended to showcase a few aspects of
analyzing 24-hour accelerometer data:
- The right-skewed distribution typically seen in both aggregate and
minute-by-minute accelerometer data
- Potential strategies for modeling continuous and categorical
predictors of accelerometer-measured physical activity
- How to include smooth, nonlinear predictors in models of activity
data
- A functional data analysis (FDA) approach to modeling 24-hour
activity
Read in data
As noted above, the data are already preprocessed into one-minute
epochs. This substantially reduces the size of the data we need to
download. Notice how for each subject we’re calculating a
total_MIMS variable that is just a sum of all of their
accumulated MIMS units throughout the day.
library(tidyverse)
library(viridis) #pretty colors
library(mgcv) #for semiparametric regression
library(refund) #for functional regression
library(reshape2) #just for melt
library(gratia) #Viewing mgcv model results
# --- Load in our files and main dataframe ----
main_df <- read_csv("data/main_df.csv", show_col_types = FALSE)
all_files <- list.files("data/day_mims/", pattern="*.csv")
#Preallocate
df_list <- list()
day_list <- list()
subject_df_list <- list()
Y <- matrix(data=NA,nrow=length(all_files),ncol=1440) #1440 minutes in a day
for (i in 1:length(all_files)){
f <- all_files[i]
this_subject <- gsub("\\_mims.csv$", "", f)
this_df <- read_csv(paste("data/day_mims/", f, sep=''), show_col_types = FALSE) %>%
mutate(hours = seq(0,24,length.out=1440),
sub_code = this_subject)
#Total activity
this_subject_df <- main_df %>% filter(sub_code == this_subject) %>%
mutate(total_mims = sum(this_df$MIMS_UNIT))
day_list[[i]] <- this_df
Y[i,] <- this_df$MIMS_UNIT #Drop in 24hr activity data
df_list[[i]] <- this_subject_df #Slightly untidy but works
}
scalar_df <- bind_rows(df_list)
day_df <- bind_rows(day_list)
Examine one day
Here’s a pretty typical day from the dataset:
ix <- 2 #Which subject?
hour_seq <- seq(0, 24, by = 3)
formatted_hours <- sprintf("%02d:00", hour_seq)
ggplot(mapping = aes(x=seq(0,24,length.out=1440),
y = Y[ix,])) +
geom_ribbon(aes(ymax = Y[ix,], ymin=0), fill = "navy", alpha = 0.4) +
geom_line(linewidth = 0.5, color = "navy") +
ggtitle("24hrs of activity data, 1min summary") +
scale_x_continuous(limits=c(0,24), breaks = hour_seq, expand = c(0.01,0),
labels = formatted_hours,
name = 'Time of day') +
scale_y_continuous(name = 'Activity level (MIMS units)') +
theme(plot.title = element_text(hjust=0.5))

And here’s data from several subjects:
plt_subs <- day_df$sub_code %>% unique() %>% head(9)
day_df %>%
filter(sub_code %in% plt_subs) %>%
ggplot(aes(x=hours, y=MIMS_UNIT, group = sub_code)) +
geom_ribbon(aes(ymax = MIMS_UNIT, ymin=0), fill = "navy", alpha = 0.4) +
geom_line(linewidth = 0.25, color="navy") +
facet_wrap(~sub_code, ncol=3)

It’s also worth looking at the histogram of MIMS units throughout the
day:
ggplot(mapping = aes(x=Y[ix,])) +
geom_histogram(aes(y = after_stat(density)), colour="black", fill="navy",
alpha = 0.2, bins = 25) +
geom_density(color = "navy", linewidth=1) +
ggtitle("Distribution of activity level during the day") +
scale_x_continuous(name = "Activity level (MIMS units)")

It’s very right-skewed! This can be modestly improved with a
log(x+1) transform (+1 because there are zero values).
ggplot(mapping = aes(x=log(Y[ix,] + 1))) +
geom_histogram(aes(y = after_stat(density)), colour="black", fill="navy",
alpha = 0.2, bins = 25) +
geom_density(color = "navy", linewidth=1) +
ggtitle("Distribution of activity level during the day") +
scale_x_continuous(name = "Log(activity + 1)")

Aggregating daily activity
A simple way to summarize someone’s activity level is to just add up
their activity across the day, as we did above. Here’s the distribution
of total MIMS units, at the subject level:
scalar_df %>%
ggplot(aes(x=total_mims)) +
geom_histogram(aes(y = after_stat(density)), colour="black", fill="navy",
alpha = 0.2, bins = 15) +
geom_density(color = "navy", linewidth=1) +
ggtitle("Sum of daily activity (N=50 subjects)") +
scale_x_continuous(name = "Total MIMS units")

Again, the situation improves a bit with a log transform (no plus one
here since there are no zeros).
scalar_df %>%
ggplot(aes(x=log(total_mims))) +
geom_histogram(aes(y = after_stat(density)), colour="black", fill="navy",
alpha = 0.2, bins = 15) +
geom_density(color = "navy", linewidth=1) +
ggtitle("Sum of daily activity (N=50 subjects)") +
scale_x_continuous(name = "log(Total MIMS units)")

From a modeling situation either modeling total_mims or
log(total_mims) is likely justifiable. Here’s a simple
model testing to see if there is a (smooth, possibly nonlinear)
association between age and total activity level.
#k is number of knots for the spline - with REML penalization, we just need "enough"
# - too many is not really a big concern
k <- 7 # an approximate rule for k is sqrt() of number of unique values, here 49
age_model <- gam(total_mims ~ s(age, bs="cr", k=k),
method = "REML",
data = scalar_df)
summary(age_model) #p = 0.049, I promise I didn't p-hack this result!
##
## Family: gaussian
## Link function: identity
##
## Formula:
## total_mims ~ s(age, bs = "cr", k = k)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2840.6 152.7 18.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age) 1 1 4.067 0.0493 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0589 Deviance explained = 7.81%
## -REML = 405.99 Scale est. = 1.1664e+06 n = 50
gratia::draw(age_model) #Using Gavin Simpson's nice GAM viz tool

So, we have weak evidence that people get less active with age (not
that surprising). In this case it does seem like a linear fit would be
sufficient (though we didn’t formally test for it), as evidenced in part
by the edf of 1 for the smooth term for age. With larger
datasets this will not generally be the case, though. We could’ve fit to
log(total_mims); it improves the \(R^2\) value a bit, but doesn’t make a huge
difference.
In any case, here’s a linear fit to the data:
scalar_df %>%
ggplot(aes(x=age, y=total_mims)) +
geom_point() +
geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'

Analyzing the 24-hour activity cycle
A more sophisticated analysis might involve looking at how 24-hour
activity changes with age. This, too, can be modeled; we need a
functional data model to do so. Below, I use the functional
regression framework introduced by Sonja Greven and Fabian Scheipl
in the refund package, which extends the scalar GAM from
above to work with functional data. Here, our “function” of interest is
the 24-hour activity curve that we plotted earlier. Each subject’s
24-hour day is one observation, and we can study how that observation
changes as a function of covariates like age or whether this subject has
a history of falls.
Greven and Schiepl’s framework is very clever: it reframes functional
regression as scalar regression, with the residuals being i.i.d.
conditional on the model. That just means you need to model all of the
sources of non-iid variation in the data, and your model will be
correctly specified. The biggest change from the scalar model is the
introduction of smooth residuals, \(E_i(t)\), which are modeled as random
functional effects. So the model fit below is:
\(y = ...\)
The refund interface is very similar to
mgcv, given that they’re built on top of each other. Here’s
code that fits a functional version of the same model from above, to
study how the 24hr activity changes as a function of age.
For simplicity, we consider only a constant linear effect of age
(~ c(age)), though remember on the log scale this implies
multiplicative effects. With more subjects, it’d be reasonable to look
at a smooth linear functional effect (~ age), and with lots
of subjects you could even do a smooth nonlinear functioanl effect
(~ s(age)).
For a more interpretable intercept, notice that we’re rescaling age
to age5_centered: now, an age of “0” corresponds to being
75 years old, and a one unit change in age is a 5-year increase in age.
This just helps make the intercept term \(\beta_0(t)\) more easily-interpretable:
it’s the expected 24hr activity level for a 75-year-old (after
log(x+1) transformation of course).
This model takes a little while to fit–the biggest computational cost
is the spline basis for the smooth random effects.
#Warning - takes a while to run!
#for now
scalar_df$age5_centered <- (scalar_df$age - 75)/5
#Standardize to 75 yr old, 5yr change = one unit change for \beta (t)
scalar_df$sub_code <- factor(scalar_df$sub_code) #mgcv requires factors for bs="re"
Y_lp1 <- log(Y + 1) # Log(x+1) transform, better for Gaussian link function
y_ix <- seq(0,24, length.out = 1440) #Time index, hours (0-24)
k_int <- 24 #knots for global intercept - try 16-24
k_beta <- 64 #knots for functional effect including random effects - try 48-64
# ^^^ k_beta is the biggest computational cost in the model. I think it's ~O(n^2) or even ^3
# however k_beta must be big enough for smooth residuals to fit each curve well
# so it's very important for the model to have enough df to fit each curve accurately
ctrl <- mgcv::gam.control(trace = TRUE) # verbose=1 to track fitting
#Note use of bs='cp' to induce cyclic continuity, since this is 24hr data
# (induces a warning about marginal spline reparameterization)
mod_fx <- pffr(Y_lp1 ~ c(age5_centered) + s(sub_code, bs="re"),
data = scalar_df,
yind = y_ix, #Vector of time index, from 0 to 24 hours
bs.yindex = list(bs="cp", k=k_beta, m=c(2,1)), #k cubic P-splines
bs.int = list(bs = "cp", k=k_int, m=c(2,1)),
algorithm = "bam", #big additive model mode, for speed
method = "fREML", #fast REML approximation, for speed
control = ctrl,
discrete = TRUE) #use discrete approximation, for speed
## Setup complete. Calling fit
## Fit complete. Finishing gam object.
## user.self sys.self elapsed
## initial 0.17 0.02 0.19
## gam.setup 20.54 0.25 20.78
## pre-fit 0.00 0.00 0.00
## fit 1025.71 3.40 1029.27
## finalise 0.00 0.00 0.00
summary(mod_fx, re.test = FALSE) #re.test slows down summary, not needed here
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Y_lp1 ~ c(age5_centered) + s(sub_code, bs = "re")
##
## Constant coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.61230 0.01916 31.958 < 2e-16 ***
## age5_centered -0.04612 0.01657 -2.783 0.00538 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Smooth terms & functional coefficients:
## edf Ref.df F p-value
##
## R-sq.(adj) = 0.63 Deviance explained = 64.6%
## fREML score = 63022 Scale est. = 0.29327 n = 72000 (50 x 1440)
plot(mod_fx, pages=1)

We can visually inspect some subject-specific curves to see how well
the model is fitting the curves:
#Hack the underlying mgcv::gam to get the partial effects
pred_df <- predict(mod_fx) %>% as.data.frame() %>%
pivot_longer(everything(), names_prefix = "V", values_to = "yhat",
names_to = "sample") %>%
mutate(sample = as.numeric(sample),
hours = (sample-1)/1439*24) %>% #care, off-by-one
mutate(yhat_raw = exp(yhat) - 1) %>% #transform back to raw scale
mutate(sub_code = rep(scalar_df$sub_code, each = 1440))
#Pick 9 subjects to plot
plot_pred_df <- pred_df %>%
filter(sub_code %in% plt_subs)
day_df %>%
filter(sub_code %in% plt_subs) %>%
ggplot(aes(x=hours, y=MIMS_UNIT, group = sub_code)) +
geom_ribbon(aes(ymax = MIMS_UNIT, ymin=0), fill = "navy", alpha = 0.2) +
geom_line(linewidth = 0.25, color="navy", alpha = 0.4) +
geom_line(aes(y=yhat_raw), data = plot_pred_df,
color = "red", linewidth = 1, alpha = 0.6) +
facet_wrap(~sub_code) +
scale_x_continuous(limits=c(0,24), breaks = hour_seq, expand = c(0.02,0))

We can also check out the fit on the “linear predictor” scale
(log(x+1))
day_df %>%
mutate(log_p1_mims = log(MIMS_UNIT + 1)) %>%
filter(sub_code %in% plt_subs) %>%
ggplot(aes(x=hours, y=log_p1_mims, group = sub_code)) +
geom_ribbon(aes(ymax = log_p1_mims, ymin=0), fill = "navy", alpha = 0.2) +
geom_line(linewidth = 0.25, color="navy", alpha = 0.4) +
geom_line(aes(y=yhat), data = plot_pred_df,
color = "red", linewidth = 1, alpha = 0.6) +
facet_wrap(~sub_code) +
scale_x_continuous(limits=c(0,24), breaks = hour_seq, expand = c(0.02,0))

Now let’s take a look at the functional effects:
#Hacking the underlying GAM to get raw predictions
#y_ix.vec name is a consequence of passing variable y_ix earlier.
#If we used t, it would be called t.vec
int_df <- data.frame(y_ix.vec = seq(0,24,length.out = 1440)) %>%
mutate(sub_code = factor("CO002"),
age5_centered = 1)
#Will ignore subject when constructing intercept, needed to dodge predict warning
#iterms to include intercept uncertainty
beta_preds <- mgcv::predict.gam(mod_fx, type="iterms",
newdata = int_df, se.fit = TRUE)
#Then do response for yhat on lp1 scale
smooth_select <- paste("s(y_ix.vec):", "c(age5_centered)", sep="")
smooth_select <- "age5_centered" #if c(age5_centered)
y_constant <- attr(beta_preds, "constant")
y_functional <- beta_preds$fit[,"s(y_ix.vec)"]
y_functional_se <- beta_preds$se.fit[,"s(y_ix.vec)"]
beta_t <- beta_preds$fit[,smooth_select]
#LP - linear predictor + 95% CIs
y_LP <- y_constant + y_functional
y_LP_low <- y_constant + y_functional - 1.96*y_functional_se
y_LP_hi <- y_constant + y_functional + 1.96*y_functional_se
#Also hack-ish, would be better with tidyfun:: methods but would introduce more dependencies
beta_df <- data.frame(hours = seq(0,24,length.out=1440),
y_LP_65 = y_constant + y_functional + -2*beta_t,
y_LP_70 = y_constant + y_functional + -1*beta_t,
y_LP_75 = y_constant + y_functional + 0*beta_t,
y_LP_80 = y_constant + y_functional + 1*beta_t,
y_LP_85 = y_constant + y_functional + 2*beta_t,
y_LP = y_constant + y_functional,
y_LP_low = y_constant + y_functional - 1.96*y_functional_se,
y_LP_hi = y_constant + y_functional + 1.96*y_functional_se) %>%
mutate(yhat_raw = exp(y_LP) - 1,
yhat_raw_low = exp(y_LP_low) - 1,
yhat_raw_hi = exp(y_LP_hi) - 1
) %>%
mutate(y_65 = exp(y_LP_65) - 1,
y_70 = exp(y_LP_70) - 1,
y_75 = exp(y_LP_75) - 1,
y_80 = exp(y_LP_80) - 1,
y_85 = exp(y_LP_85) - 1)
beta_df %>%
ggplot(aes(x=hours, y=yhat_raw)) +
geom_line(color = "navy") +
geom_ribbon(aes(ymin = yhat_raw_low, ymax = yhat_raw_hi),
fill = "navy", alpha = 0.2) +
scale_x_continuous(limits=c(0,24), breaks = hour_seq, expand = c(0.01,0),
labels = formatted_hours,
name = 'Time of day') +
ggtitle("Estimated average daily activity pattern for a 75 yr old")

Plotting age estimates:
lwd <- 2
age_col <-viridis(5, direction = 1) #5 colors for 65-85
beta_df %>%
ggplot(aes(x=hours, y=yhat_raw)) +
geom_line(aes(y=y_65), color = age_col[1], linewidth = lwd) +
geom_line(aes(y=y_70), color = age_col[2], linewidth = lwd) +
geom_line(aes(y=y_75), color = age_col[3], linewidth = lwd) +
geom_line(aes(y=y_80), color = age_col[4], linewidth = lwd) +
geom_line(aes(y=y_85), color = age_col[5], linewidth = lwd) +
geom_point(aes(color = hours), alpha = 0) +
scale_color_viridis_c(option="viridis", limits = c(65,85),
breaks = c(65, 70, 75, 80, 85),
labels = c('65', '70', '75', '80', '85')) +
guides(color = guide_colorbar(title = "Age (yrs)")) +
scale_x_continuous(limits=c(0,24), breaks = hour_seq, expand = c(0.01,0),
labels = formatted_hours,
name = 'Time of day') +
ggtitle("Estimated effect of age on 24-hr activity") +
theme(legend.position = "bottom")

Possible avenues for improvement
- Using repeated days within subjects (raw dataset has 2-3 days per
person). Would need a subject-level random effect as well as a
curve-level random effect.
- Intelligently dealing with missing data, either through sparse FDA
or other methods
- Trying a poisson, beta, or other extended-family regression model to
deal with zero-inflation
- Using principal component smooth residuals instead of P-spline basis
residuals (
pcre() terms in pffr) - might fit
much faster since dimensionality of the smooth residuals is the
computational bottleneck
Some residual diagnosics
We can check how “good” our iid assumption is by plotting the
residual functions and the residual correlation matrix. A
perfectly-specified pffr model should have residuals that
look like pure white noise centered around zero, and should have a
residual correllation matrix that is just the identity matrix (all
off-diagonal correlations = 0)
I suspect the worse-ish autocorrelation in the upper left (remember,
that’s midnight to ~6am) is from some–but not all–subjects taking the
device off at night, creating a perfect flat line. As above, that might
be addressed by a cleverer transformation than
log(x+1).
resid_E = residuals(mod_fx)
R_E <- cor(resid_E)
matplot(t(resid_E), type="l")
title("Residual functions for original model")

melted_R <- melt(R_E)
ggplot(data = melted_R, aes(x=Var2, y=Var1, fill=value)) +
geom_raster() +
scale_x_continuous(expand = c(0,0)) +
scale_y_reverse(expand = c(0,0)) +
scale_fill_gradientn(colors = c("#e41a1c", "white", "#377eb8"),
limits = c(-1, 1),
breaks = c(-1, 0, 1)) +
ggtitle('Residual autocorrelation in original model')

Fit an intentionally-misspecified model and compare
If we skip the subject/curve-specific random effects (one and the
same since we have one day from each subject) we misspecify the model
and should have a lot more residual autocorrelation. Do we?
mod_mis <- pffr(Y_lp1 ~ c(age5_centered),
data = scalar_df,
yind = y_ix, #Vector of time index, from 0 to 24 hours
bs.yindex = list(bs="cp", k=k_beta, m=c(2,1)), #k cubic P-splines
bs.int = list(bs = "cp", k=k_int, m=c(2,1)),
algorithm = "bam",
method = "fREML", #fast REML approximation
control = ctrl,
discrete = TRUE) #use discrete approximation for speed
## Setup complete. Calling fit
## Fit complete. Finishing gam object.
## user.self sys.self elapsed
## initial 0.03 0 0.03
## gam.setup 0.00 0 0.00
## pre-fit 0.00 0 0.00
## fit 0.01 0 0.02
## finalise 0.02 0 0.01
#summary(mod_mis, re.test = FALSE)
#plot(mod_mis, pages=1)
resid_M = residuals(mod_mis)
R_M <- cor(resid_M)
melted_RM <- melt(R_M)
matplot(t(resid_M), type="l")
title("Residual functions for misspecified model")

ggplot(data = melted_RM, aes(x=Var2, y=Var1, fill=value)) +
geom_raster() +
scale_x_continuous(expand = c(0,0)) +
scale_y_reverse(expand = c(0,0)) +
scale_fill_gradientn(colors = c("#e41a1c", "white", "#377eb8"),
limits = c(-1, 1),
breaks = c(-1, 0, 1)) +
ggtitle('Residual autocorrelation in misspecified model')

Much worse! The lack of smooth residuals results in a lot of residual
autocorrellation and very non-iid residual functions.